I am doing a secondary analysis of gene expression data from tumor biopsy samples in a study of patients with rectal cancer. The dataset includes basic clinical covariates and recurrence data. I will determine what gene expression is associated with recurrence, and also perform gene set enrichment analysis to determine whether any pathways are up- or down-regulated in tumors that respond to chemoradiation compared to tumors that do not respond.
The standard of care for stage III rectal cancer is combination chemoradiation followed by surgical resection of the primary tumor and lymph nodes. There is interest in developing predictive tools that can predict which patients will have a complete response to chemoradiation and could potentially avoid surgery, which would leave the patient with a colostomy bag.
Traditional clinical factors such as demographic variables, stage, etc are not sufficient to predict which patients will have good responses to chemoradiation, so investigators are focusing on radiomics and genomics to develop predictive models. These are collaborative studies between radiation oncologists, radiologists, and bioinformaticians. I met several times with Mengyuan Kan, who taught me about the RAVED pipeline she developed and how I can adapt it to my chosen data set.
I used publicly available gene expression and phenotype data from GEO. This study by Watanabe et al. obtained biopsies of the primary tumor for 46 patients with stage III rectal cancer, administered pre-operative chemoradiation, and then resected the primary tumors. Patients were labeled as “Responders” if they had a complete or near-complete pathological response according to a classification system by the Japanese Society for Cancer of the Colon and Rectum. There was no other clinical data included in the dataset.
QC and analysis scripts for differential expression were adapted from the RAVED pipeline PubMed.
This report shows the QC steps for gene expression microarry data from GEO study, including:
Manually define the variables used in the RAVED pipeline
# GEO id
geo_id="GSE35452"
GPL_ID="GPL570"
# directory stores GEO data
datadir="C:/Users/jahan/Desktop/GSE35452"
# directory stores generated files
resdir="C:/Users/jahan/Desktop/GSE35452"
#platform="Affymetrix" #Try changing to Illumina
platform="Affymetrix"
geo_GPL=""
normdata=FALSE
suppldata=TRUE
paired=FALSE
usesuppl=TRUE
# The shortname_func function shortens the sample name shown in the plots. To start, define shortname_func <- function(x){x}
shortname_func <- function(x){gsub("^(.*)\\.(cel|CEL).gz","\\1",x)}
Load the necessary libraries. Load affy and dplyr packages later since they will mask other functions.
Download GEO series matrix files if available.
library(dplyr)
pheno.raw=pData(gse)
cols <- c("title","geo_accession","source_name_ch1","response to preoperative chemoradiotherapy:ch1")
pheno <- pheno.raw %>%
dplyr::select(cols) %>%
dplyr::rename(Response="response to preoperative chemoradiotherapy:ch1") %>%
dplyr::mutate(Patient=seq(1:nrow(pheno.raw)),
Response=ifelse(Response=="Non-responder","Nonresponder","Responder"),
Disease="Rectal_cancer",
Tissue="Tumor",
Response=factor(Response),
GEO_ID=geo_accession)
pheno$Response[pheno$Response=="Non-responder"]=factor("Nonresponder")
pData(gse)=dplyr::rename(pData(gse),Response='response to preoperative chemoradiotherapy:ch1')
pData(gse)$Response[pData(gse)$Response=='Non-responder']='Nonresponder'
detach("package:dplyr")
Download supplementary raw data files.
## Loading required package: pd.hg.u133.plus.2
## Loading required package: RSQLite
## Loading required package: DBI
## Platform design info loaded.
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868548_102_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868549_K_102.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868550_K_104.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868551_105_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868552_106_Rad_2.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868553_107_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868554_108_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868555_109_Rad_2.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868556_110_Rad_2.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868557_111_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868558_112_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868559_113_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868560_114_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868561_115_RCa.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868562_116_RCa.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868563_118_RCa.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868564_120_Rad_2.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868565_121_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868566_122_Rad_2.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868567_124_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868568_126_Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868569_128Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868570_129Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868571_130Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868572_131Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868573_135Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868574_136Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868575_137Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868576_143Rad.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868577_B_149.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868578_B_153.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868579_B_158.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868580_B_164.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868581_B_166.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868582_B_168.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868583_B_170.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868584_B_173.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868585_B_175.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868586_B_179.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868587_B_181.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868588_B_183.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868589_Rad_97.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868590_Rad86.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868591_Rad89.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868592_Rad91.CEL.gz
## Reading in : C:/Users/jahan/Desktop/GSE35452/data/GSM868593_Rad93.CEL.gz
Assign phenotype data from gse object to raw data object.
Retrieve scan date information from raw.data object for batch effect adjustment. For Affymetrix data, scan date information is imported by oligo with raw data files.
Show the summary of phenotype variables and the sample size for different groups
| Tissue | Disease | Response | Count |
|---|---|---|---|
| Tumor | Rectal_cancer | Nonresponder | 22 |
| Tumor | Rectal_cancer | Responder | 24 |
| ScanDate_Group | Count |
|---|---|
| 2006 | 4 |
| 2007 | 22 |
| 2008 | 8 |
| 2009 | 12 |
Assign colors to scan date or disease/Response if scan date is not available.
If negative/zero intensity values are present, convert them to NAs.
Write the phenotype data set to .txt format for storage
The major QC steps and scoring methods for outliers were adapted from arrayQualityMetrics. The threshold to determine an outlier used in arrayQualityMetrics is the boxplot’s upper whisker, i.e. values beyond 1.5 times the interquartile range, which is also applied to our pipeline. The following QC metrics are included in a routine analysis. The QC metrics used for outlier detection are marked with an asterisk.
The log2-transformed/normalized intensity distributions of all samples (arrays) are expected to have the similar scale (i.e. the similar positions and widths of the boxes). Outlier detection is applied by computing a Kolmogorov-Smirnov statistic (Ka) between log-intensity distribution for one array and the pooled array data, where an array with a Ka beyond the upper whisker is designated as an outlier.
Compute the Kolmogorov-Smirnov statistic Ka between each array’s (i.e. sample) values (i.e. log2 transformed raw probe intensity values) and the pooled, overall distribution of the values.
## 0 outlier(s) are detected in the raw intensity metrics.
The intensity curves of all samples (arrays) are expected to have the similar shapes and ranges. Samples with deviated curves are likely to have problematic experiments. For example, high levels of background will shift an array’s distribution to the right. Lack of signal diminishes its right right tail. A bulge at the upper end of the intensity range often indicates signal saturation.
Overall RNA quality can be assessed by RNA degradation plots. In the gene expression array, each probe is represented by a probe set. Each probe set is 11-20 probes (pairs of oligos). This plot shows the average intensity of each probe across all probe sets, ordered from the 5’ to the 3’ end. It is expected that probe intensities are lower at the 5’ end of a probe set when compared to the 3’ end as RNA degradation starts from the 5’ end of a molecule. RNA which is too degraded will shows a very high slope from 5’ to 3’. Thus, the standardized slope of the RNA degradation plot serves as quantitative indicator of the RNA degradation.
There are two paired probe types: perfect match (PM) and of mismatched (MM) probes. A PM probe matches a strand of cDNA, while the corresponding MM probe differs from the PM by a change in the central nucleotide. A probe set is called present if the intensity value of PM is significantly larger than MM. However, the Affymetrix approach is under attack because between 15%-30% of the MM are greater than the PM. For some newer arrays, MM probes are not used. If the number of PMs is not equal to that of MMs, this might be a PM-only array.
If both PM and MM are present, the density curves of log2 PM and MM intensities are generated, where MM probes are expected to have smaller log2-intensity at the peak than PM probes due to their nonspecific hybridization.
MA plots allow pairwise comparison of the log-intensity of each array to a reference array and identification of intensity-dependent biases.
The y-axis of the MA-plot shows the log-ratio intensity of one array to the reference median array, which is called M (minus). M = log2(I1)-log2(I2) (I1: the intensity of the array studied; I2: the median intensity across arrays)
The x-axis indicates the average log-intensity of both arrays, which is called A (add). A = 1/2*(log2(I1)+log2(I2))
It is expected that the probe levels do not differ systematically within a group of replicates, so that the MA-plot is centered on the y-axis (y=0 or M=0) from low to high intensities.
Outlier detection is applied by computing a Hoeffding’s statistic (Da) on the joint distribution of A and M for each array, where an array with a Da >0.15 is designated as an outlier.
## 0 outliers are detected in the MA metrics.
MA plots of the samples with the 4 highest and lowest Hoeffding’s statistics.
Spatial plots show an artificial colored image of an array’s spatial distribution of intensities that indicate spatial variation in an array. Log-intensities of probes are plotted by their corresponding spatial x and y-coordinate in the array and are expected to be uniformly distributed if the array data has good quality. The rank scale is applied for plotting as it has the potential to amplify patterns that are small in amplitude but systematic within an array.
Outlier detection is applied by computing a sum of the absolute values of low frequency Fourier coefficients (Fa) across all probe sets for each array, where an array with a Fa beyond the upper whisker is designated as an outlier.
## 0 outlier(s) are detected in the spatial metrics.
Spatial distribution plots of samples with the 4 highest and lowest values of Fa. The value of Fa for each sample is shown in the panel headings. Outliers marked with * have Fa values of large scale spatial structures.
The normalized unscaled standard error (NUSE) and relative log expression (RLE) boxplots indicate probe set homogeneity in one array, where the metrics are derived from a fitted probe level model by the fitProbeLevelModel function (oligo). The RLE plots represent the distribution of the ratio between the log-intensity of a probe set and the median log-intensity of the corresponding probe set across all arrays, expected to be centered near 0, as a log scale is applied. Outlier detection is applied by computing a Kolmogorov-Smirnov statistic (Ra) between RLE distribution for one array and the pooled array data, where an array with a Ra beyond the upper whisker is designated as an outlier
Compute the Kolmogorov-Smirnov statistic Ra between each array’s (i.e. sample) values (i.e. relative log expression values) and the pooled, overall distribution of the values. Detect outliers that are deviated from the threshold.
Use boxplot_func function to plot RLE. Outliers marked with * have values centered away from 0 and/or are more spread out are potentially problematic.
The NUSE plots show the distribution of normalized standard error estimates, expected to be centered near 1. Outlier detection is applied by computing an upper hinge (Na) across all probe sets for each array, where an array with a Na beyond the upper whisker is designated as an outlier.
Compute 75% quantile Na of each array’s NUSE values Detect outliers that have larger Na deviated from the threshold.
3 outlier(s) are detected in NUSE metrics.
Use boxplot_func function to plot RLE. Outliers marked with * have values centered away from 0 and/or are more spread out are potentially problematic.
Distance between arrays is evaluated using mean absolute difference of log-intensity/normalized intensity between each pair of arrays, where the hierarchical tree between arrays is created based on the distance, which is visualized by a heatmap and dendrogram.
The distance d(ab) between two arrays a and b is computed as the mean absolute difference (L1-distance) between the data of the arrays (using the data from all probes without filtering). In the formula (the dist2 function from genefilter package), d(ab) = mean | M(ai) - M(bi) |, where M(ai) is the value of the i-th probe on the a-th array.
Outlier detection is applied by computing the sum of the distances of one array to all other arrays (Sa) (Sa=Sum(b)d(ab)), where an array with a Sa beyond the upper whisker is designated as an outlier.
## 0 outlier(s) are detected in sample distance metrics.
PCA demonstrates information of the expression dataset in a reduced number of dimensions. Clustering and PCA plots enable to assess to what extent arrays resemble each other, and whether this corresponds to the known resemblances of the samples.
| PC | Proportion of Variance (%) | Cumulative Proportion of Variance (%) |
|---|---|---|
| PC1 | 42.83 | 42.83 |
| PC2 | 10.28 | 53.11 |
| PC3 | 4.018 | 57.13 |
| PC4 | 3.234 | 60.36 |
| PC5 | 2.76 | 63.12 |
| PC6 | 2.32 | 65.44 |
| PC7 | 1.748 | 67.19 |
| PC8 | 1.581 | 68.77 |
| PC9 | 1.348 | 70.12 |
| PC10 | 1.315 | 71.44 |
PCA plots are generated using the first two principle components colored by known factors (e.g. Response/disease conditions, tissue, Patients and scan dates), visualizing similarities between arrays and these similarities’ correlation to batch effects.
The summary of outliers and detection methods
| Frequency | Method | |
|---|---|---|
| GSM868591_Rad89.CEL.gz | 1 | NUSE |
| GSM868592_Rad91.CEL.gz | 1 | NUSE |
| GSM868593_Rad93.CEL.gz | 1 | NUSE |
| GSM868554_108_Rad.CEL.gz | 1 | RLE |
| GSM868562_116_RCa.CEL.gz | 1 | RLE |
Create a new column “QC_Pass” in the phenotype file. By default, samples detected as an outlier more than twice are assigned to 0 otherwise to 1.
## 0 outlier(s) are defined.
QC identified no significant outlier samples that needed to be discarded. Description of the analysis findings will precede figures and tables below.
Mannually assign variables used in the RAVED pipeline
tissue="Tumor" # naming is rigid, should be same as the Tissue defined in QC
# reference condition
con0="Nonresponder" # naming is rigid, should be same as the Response defined in QC
# altered condition
con1="Responder" # naming is rigid, should be same as the Response defined in QC
# Response. Assign "comparison" if this column is used for DE.
Response=c(con0,con1)
# disease. Assign "comparison" if this column is used for DE.
disease="Rectal_cancer"
pheno_fn=paste0(resdir,"/",geo_id,"_Phenotype_withQC.txt")
Read in pre-prepared post-QC phenotype data
Subset phenotypes based on the comparison variables. Check variables (before and after QC if outliers are detected).
| ScanDate_Group | Status | Counts |
|---|---|---|
| 2006 | Nonresponder | 2 |
| 2007 | Nonresponder | 11 |
| 2008 | Nonresponder | 4 |
| 2009 | Nonresponder | 5 |
| 2006 | Responder | 2 |
| 2007 | Responder | 11 |
| 2008 | Responder | 4 |
| 2009 | Responder | 7 |
| GEO_ID | GSE35452 |
| Tissue | Tumor |
| App | Response |
| Disease | Rectal_cancer |
| Response | Responder |
| N_Condition0 | 22 |
| N_Condition1 | 24 |
| Total | 46 |
| Unique_ID | GSE35452_Tumor_Rectal_cancer_Responder_vs_Nonresponder |
Assign colours to status and scan date (if available)
Obtain raw.data.of.interest by subsetting raw.data based on the phenotype of interest
Normalize gene expression raw data using robust multi-array average (RMA) method
If negative/zero intensity values are present, convert them to NAs.
Fit a linear model to RMA log-intensity values, fit this model to a contrast matrix for the comparison of interest, and apply empirical Bayes smoothing to obtain more precise standard errors.
| Sample | Nonresponder | Responder |
|---|---|---|
| GSM868548_102_Rad.CEL.gz | 1 | 0 |
| GSM868549_K_102.CEL.gz | 1 | 0 |
| GSM868550_K_104.CEL.gz | 1 | 0 |
| GSM868551_105_Rad.CEL.gz | 1 | 0 |
| GSM868552_106_Rad_2.CEL.gz | 1 | 0 |
| GSM868553_107_Rad.CEL.gz | 0 | 1 |
| GSM868554_108_Rad.CEL.gz | 0 | 1 |
| GSM868555_109_Rad_2.CEL.gz | 0 | 1 |
| GSM868556_110_Rad_2.CEL.gz | 0 | 1 |
| GSM868557_111_Rad.CEL.gz | 1 | 0 |
| GSM868558_112_Rad.CEL.gz | 0 | 1 |
| GSM868559_113_Rad.CEL.gz | 0 | 1 |
| GSM868560_114_Rad.CEL.gz | 0 | 1 |
| GSM868561_115_RCa.CEL.gz | 1 | 0 |
| GSM868562_116_RCa.CEL.gz | 0 | 1 |
| GSM868563_118_RCa.CEL.gz | 1 | 0 |
| GSM868564_120_Rad_2.CEL.gz | 1 | 0 |
| GSM868565_121_Rad.CEL.gz | 1 | 0 |
| GSM868566_122_Rad_2.CEL.gz | 1 | 0 |
| GSM868567_124_Rad.CEL.gz | 0 | 1 |
| GSM868568_126_Rad.CEL.gz | 0 | 1 |
| GSM868569_128Rad.CEL.gz | 1 | 0 |
| GSM868570_129Rad.CEL.gz | 0 | 1 |
| GSM868571_130Rad.CEL.gz | 1 | 0 |
| GSM868572_131Rad.CEL.gz | 0 | 1 |
| GSM868573_135Rad.CEL.gz | 0 | 1 |
| GSM868574_136Rad.CEL.gz | 1 | 0 |
| GSM868575_137Rad.CEL.gz | 0 | 1 |
| GSM868576_143Rad.CEL.gz | 1 | 0 |
| GSM868577_B_149.CEL.gz | 0 | 1 |
| GSM868578_B_153.CEL.gz | 0 | 1 |
| GSM868579_B_158.CEL.gz | 1 | 0 |
| GSM868580_B_164.CEL.gz | 0 | 1 |
| GSM868581_B_166.CEL.gz | 0 | 1 |
| GSM868582_B_168.CEL.gz | 1 | 0 |
| GSM868583_B_170.CEL.gz | 1 | 0 |
| GSM868584_B_173.CEL.gz | 1 | 0 |
| GSM868585_B_175.CEL.gz | 0 | 1 |
| GSM868586_B_179.CEL.gz | 0 | 1 |
| GSM868587_B_181.CEL.gz | 1 | 0 |
| GSM868588_B_183.CEL.gz | 0 | 1 |
| GSM868589_Rad_97.CEL.gz | 0 | 1 |
| GSM868590_Rad86.CEL.gz | 1 | 0 |
| GSM868591_Rad89.CEL.gz | 1 | 0 |
| GSM868592_Rad91.CEL.gz | 0 | 1 |
| GSM868593_Rad93.CEL.gz | 0 | 1 |
| Response | |
|---|---|
| Nonresponder | -1 |
| Responder | 1 |
Check whether to adjust for scan date and donor. Create a full model that includes all variables and a null model that only includes the batch variable. The batch used for adjustment:
## batcherror = no
Compute F statistic p-values adjusted for batch effect. Q-values are obtained by the Benjamini-Hochberg method. If the batch and the status are correlated, assign NA to the batch adjusted p- and q-values. If there is no batch variable or only one batch, assign p- and q-values computed by limma to the batch adjusted p-values
Annotate official gene symbol to probes.
## The platform used is pd.hg.u133.plus.2
## The corresponding R annotation database package is hgu133plus2.db
After adjusting p-values for multiple testing, the differential expression analysis showed no significant genes associated with response to chemoradiation as displayed in the volcano plots, histograms, and heatmaps.
Histograms of p-value distributions
res=res[!is.na(res$SYMBOL),]
Show top 200 probes sorted by un-adjusted p-values
The boxplots for the top differentially expressed genes do show differential expression but after adjustment of the p values none of these were significant.
Create plotting function
Phylogeny of genes by Response. This shows that the top 200 genes do not show striking clustering according to response to chemoradiation.
Phylogeny of genes by Scan year further confirms that no significant batch effect was present.
Having created heatmaps using the normalized raw data, I wanted to create one based on the “matrix” data uploaded by the study authors (which is already normalized)
Fit a model using matrix data
#fit with matrix data
design.gse <- model.matrix(~ -1 + factor(gse$Response))
colnames(design.gse) <- levels(factor(gse$Response))
#Fit a linear model with limma package. Expression data linked to outcome of a design matrix model
fit.gse <- lmFit(gse, design.gse)
#Create contrast groups of interest
gse.contrast <- makeContrasts(Response=Responder - Nonresponder,
levels = design.gse)
#Get the contrasts for samples of interest
fit.gse2 <- contrasts.fit(fit.gse, gse.contrast)
#Adjust fit coefficients using an empirical Bayes moderation of standard errors
fit.gse2 <- eBayes(fit.gse2)
#Results for each hypothesis test can be extracted using
treatment_results_matrix <- topTable(fit.gse2, coef = "Response", adjust = "BH", num = 200)
col.sel=c("logFC","AveExpr","t","P.Value","adj.P.Val","B","Gene.Symbol") # select the columns of DE results
treatment_results_matrix<-treatment_results_matrix[treatment_results_matrix$Gene.Symbol!='',]
head(treatment_results_matrix[,col.sel])
## logFC AveExpr t P.Value adj.P.Val B
## 207763_at -1.7191568 0.41139719 -4.668236 2.572281e-05 0.9271418 -3.053453
## 239713_at 1.3627552 -0.23920733 4.367207 6.910754e-05 0.9271418 -3.205459
## 235498_at 1.1391716 -0.16818457 4.254527 9.947878e-05 0.9271418 -3.262589
## 1564202_at 1.1650943 -0.38503094 4.224529 1.095471e-04 0.9271418 -3.277806
## 214727_at 0.7825310 -0.04256456 4.186795 1.236280e-04 0.9271418 -3.296950
## 232077_s_at -0.8883044 -0.27810220 -4.158492 1.353300e-04 0.9271418 -3.311310
## Gene.Symbol
## 207763_at S100A5
## 239713_at CASC2
## 235498_at LRRIQ3
## 1564202_at LOC100131756
## 214727_at BRCA2
## 232077_s_at YPEL3
A heatmap using the authors’ uploaded normalized data shows almost perfect clustering according to radiation response. It is difficult to tell from their methods how exactly they normalized their data. This further confirms the utility of having raw data available to perform a high quality re-analysis.
| PC | Proportion of Variance (%) | Cumulative Proportion of Variance (%) |
|---|---|---|
| PC1 | 10.33 | 10.33 |
| PC2 | 8.994 | 19.33 |
| PC3 | 7.826 | 27.15 |
| PC4 | 7.527 | 34.68 |
| PC5 | 4.152 | 38.83 |
| PC6 | 3.388 | 42.22 |
| PC7 | 3.095 | 45.32 |
| PC8 | 2.695 | 48.01 |
| PC9 | 2.619 | 50.63 |
| PC10 | 2.459 | 53.09 |
PCA plots are generated using the first two principle components. The plot shows no significant clustering according to Response to chemoradiation.
Similar to the heatmaps, I wanted to plot a PCA based on the authors’ normalized data to compare it to my analysis of the normalized raw data.
| PC | Proportion of Variance (%) | Cumulative Proportion of Variance (%) |
|---|---|---|
| PC1 | 8.945 | 8.945 |
| PC2 | 8.019 | 16.96 |
| PC3 | 6.126 | 23.09 |
| PC4 | 4.291 | 27.38 |
| PC5 | 3.362 | 30.74 |
| PC6 | 3.15 | 33.89 |
| PC7 | 2.888 | 36.78 |
| PC8 | 2.718 | 39.5 |
| PC9 | 2.496 | 42 |
| PC10 | 2.409 | 44.4 |
PCA plots are generated using the first two principle components. Similar to the PCA of the normalized raw data, these do not show significant clustering. However this is in contrast to the heiarchical clustering and perhaps reflects the different methodology these techniques use to identify clusters.
###Gene Set Enrichment Analysis (GSEA)
The previous analyses examined the significance of individual genes. GSEA examines entire pathways and results in enrichment scores that suggest whether a pathway is upregulated, down-regulated, or not differentially expressed. Based on my subject matter knowledge of radiation oncology, before this analysis I hypothesized that I would see over-expression of cell cycle pathways and down-regulation of DNA double strand break repair in the Responding patients compared to non-responding. This is because radiotherapy exerts its anti-cancer effect partially through DNA double strand breaks, and impaired ability to repair these breaks could lead to more cell kill. It is also known that cells actively going through the cell cycle are much more radiosensitive than cells that are quiescent.
Load packages and import differential expression analysis data set generated earlier
library(fgsea)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:AnnotationDbi':
##
## select
## The following object is masked from 'package:nlme':
##
## collapse
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following object is masked from 'package:oligo':
##
## summarize
## The following objects are masked from 'package:Biostrings':
##
## collapse, intersect, setdiff, setequal, union
## The following object is masked from 'package:XVector':
##
## slice
## The following objects are masked from 'package:IRanges':
##
## collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
##
## first, intersect, rename, setdiff, setequal, union
## The following object is masked from 'package:Biobase':
##
## combine
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(pander)
dir="C:/Users/jahan/Documents/BMIN503_Final_Project/fgsea/"
res_gsea=read.csv("C:/Users/jahan/Desktop/GSE35452/GSE35452_Tumor_Rectal_cancer_Responder_vs_Nonresponder.csv")
res_modi <- res_gsea %>%
dplyr::mutate(adj.P.Val=P.Value, Gene=SYMBOL) %>%
dplyr::filter(!is.na(Gene)) %>% # remove probes cannot be mapped to genes
dplyr::filter(!is.na(P.Value)) %>% # remove probes with NA p.value
dplyr::arrange(Gene, P.Value,-abs(logFC)) %>% # order by gene name, p value and descending absolute logFC values
dplyr::group_by(Gene) %>% # group by gene name
dplyr::filter(row_number()==1) %>% # select first row in each gene
dplyr::ungroup() %>%
dplyr::rename(stat=t) %>% # rename t value to stat
dplyr::select(Gene,stat) %>% # select column names
dplyr::arrange(stat) %>%
as.data.frame()
gene_stat <- res_modi$stat
names(gene_stat) <- res_modi $Gene
pathways.msigkeggreac=readRDS(file=paste0(dir,"/","MSigkeggreactome_list.RDS"))
Create FGSEA data set
fgsea_res=fgsea(pathways= pathways.msigkeggreac,stats= gene_stat,minSize=15,maxSize=500,nperm=10000,gseaParam=1)
fgsea_res <- fgsea_res[order(-NES),]
collapsedPathways <- collapsePathways(fgseaRes= fgsea_res, pathways= pathways.msigkeggreac, stats= gene_stat, gseaParam=1)
mainPathways = collapsedPathways$mainPathways
fgsea_res$enriched=ifelse(fgsea_res$NES>0,"Up-regulated","Down-regulated")
fgsea_res_collapse=fgsea_res[fgsea_res$pathway %in% mainPathways,]
Plot lollipop chart of main Pathways in FGSEA. This shows that cell cycle pathways are indeed up-regulated, but contrary to my hypothesis DNA repair pathways are also up-regulated. After examining these results my interpretation is that DNA repair pathways are up-regulated because the cells in Responding patients are more actively cycling than cells in Nonresponding patients. Thus their tumors are radiosensitive because they are cycling and this is in spite of the fact that they are overexpressing DNA repair pathways.
Among pathways that are down-regulated, I don’t see any that to my knowledge are associated with response to radiotherapy.
ggplot(fgsea_res_collapse, aes(x=NES,y=reorder(pathway,NES),color=enriched)) +
geom_segment(aes(x = 0, y=reorder(pathway,NES), xend = NES, yend = pathway), color = "grey50") +
geom_point() +xlab("Normalized Enrichment Score") + ylab("REACTOME/KEGG Pathway")
Plot enrichment pathways
plotEnrichment(pathway = pathways.msigkeggreac[["REACTOME_DNA_DOUBLE_STRAND_BREAK_REPAIR"]], stats = gene_stat) + labs(title="REACTOME_DNA_DOUBLE_STRAND_BREAK_REPAIR")
plotEnrichment(pathway = pathways.msigkeggreac[["REACTOME_DNA_REPLICATION"]], stats = gene_stat) + labs(title="REACTOME_DNA_REPLICATION")
plotEnrichment(pathway = pathways.msigkeggreac[["KEGG_CELL_CYCLE"]], stats = gene_stat) + labs(title="KEGG_CELL_CYCLE")
I analyzed gene expression array data to determine whether any genes were differentially expressed in patients who responded to chemoradiation compared to those who did not respond. I was not able to identify any individual genes that were differentially expressed after adjustment for multiple testing. However on gene set enrichment analysis I identified several pathways that were up-regulated in Responders compared to Non-responders. The up-regulated pathways that were most striking were for cell cycle and DNA repair pathways. Cells that are actively cycling are known to be more radioresponsive. Cells with deficient DNA repair are also more radiosensitive, and I believe that DNA repair pathways were upregulated in Responders because their tumors were actively cycling and thus required DNA repair machinery to be available.
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale: LC_COLLATE=English_United States.1252, LC_CTYPE=English_United States.1252, LC_MONETARY=English_United States.1252, LC_NUMERIC=C and LC_TIME=English_United States.1252
attached base packages: stats4, parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: dplyr(v.0.8.3), hgu133plus2.db(v.3.2.3), org.Hs.eg.db(v.3.8.2), hgu133plus2cdf(v.2.18.0), pd.hg.u133.plus.2(v.3.12.0), DBI(v.1.0.0), RSQLite(v.2.1.2), fgsea(v.1.10.1), Rcpp(v.1.0.3), DT(v.0.10), annotate(v.1.62.0), XML(v.3.98-1.20), AnnotationDbi(v.1.46.1), sva(v.3.32.1), BiocParallel(v.1.18.1), genefilter(v.1.66.0), mgcv(v.1.8-28), nlme(v.3.1-140), limma(v.3.40.6), pander(v.0.6.3), preprocessCore(v.1.46.0), devtools(v.2.2.1), usethis(v.1.5.1), Hmisc(v.4.3-0), Formula(v.1.2-3), survival(v.2.44-1.1), lattice(v.0.20-38), gplots(v.3.0.1.1), ggplot2(v.3.2.1), viridis(v.0.5.1), viridisLite(v.0.3.0), oligo(v.1.48.0), Biostrings(v.2.52.0), XVector(v.0.24.0), IRanges(v.2.18.3), S4Vectors(v.0.22.1), oligoClasses(v.1.46.0), knitr(v.1.26), GEOquery(v.2.52.0), Biobase(v.2.44.0) and BiocGenerics(v.0.30.0)
loaded via a namespace (and not attached): snow(v.0.4-3), backports(v.1.1.5), fastmatch(v.1.1-0), lazyeval(v.0.2.2), splines(v.3.6.1), crosstalk(v.1.0.0), GenomeInfoDb(v.1.20.0), digest(v.0.6.22), foreach(v.1.4.7), htmltools(v.0.4.0), gdata(v.2.18.0), magrittr(v.1.5), checkmate(v.1.9.4), memoise(v.1.1.0), cluster(v.2.1.0), remotes(v.2.1.0), readr(v.1.3.1), matrixStats(v.0.55.0), prettyunits(v.1.0.2), colorspace(v.1.4-1), blob(v.1.2.0), xfun(v.0.11), jsonlite(v.1.6), callr(v.3.3.2), crayon(v.1.3.4), RCurl(v.1.95-4.12), zeallot(v.0.1.0), iterators(v.1.0.12), glue(v.1.3.1), gtable(v.0.3.0), zlibbioc(v.1.30.0), DelayedArray(v.0.10.0), pkgbuild(v.1.0.6), scales(v.1.0.0), xtable(v.1.8-4), htmlTable(v.1.13.2), foreign(v.0.8-71), bit(v.1.1-14), htmlwidgets(v.1.5.1), RColorBrewer(v.1.1-2), acepack(v.1.4.1), ellipsis(v.0.3.0), ff(v.2.2-14), pkgconfig(v.2.0.3), nnet(v.7.3-12), later(v.1.0.0), tidyselect(v.0.2.5), labeling(v.0.3), rlang(v.0.4.1), munsell(v.0.5.0), tools(v.3.6.1), cli(v.1.1.0), fastmap(v.1.0.1), evaluate(v.0.14), stringr(v.1.4.0), yaml(v.2.2.0), processx(v.3.4.1), bit64(v.0.9-7), fs(v.1.3.1), caTools(v.1.17.1.2), purrr(v.0.3.3), mime(v.0.7), xml2(v.1.2.2), compiler(v.3.6.1), rstudioapi(v.0.10), testthat(v.2.3.0), affyio(v.1.54.0), tibble(v.2.1.3), stringi(v.1.4.3), ps(v.1.3.0), desc(v.1.2.0), Matrix(v.1.2-17), vctrs(v.0.2.0), pillar(v.1.4.2), lifecycle(v.0.1.0), BiocManager(v.1.30.10), data.table(v.1.12.6), bitops(v.1.0-6), httpuv(v.1.5.2), GenomicRanges(v.1.36.1), R6(v.2.4.1), latticeExtra(v.0.6-28), promises(v.1.1.0), KernSmooth(v.2.23-15), gridExtra(v.2.3), affxparser(v.1.56.0), sessioninfo(v.1.1.1), codetools(v.0.2-16), gtools(v.3.8.1), assertthat(v.0.2.1), pkgload(v.1.0.2), SummarizedExperiment(v.1.14.1), rprojroot(v.1.3-2), withr(v.2.1.2), GenomeInfoDbData(v.1.2.1), hms(v.0.5.2), grid(v.3.6.1), rpart(v.4.1-15), tidyr(v.1.0.0), rmarkdown(v.1.17), shiny(v.1.4.0) and base64enc(v.0.1-3)